The exponential growth of patent filings in technology intensive areas has made manual prior-art searches increasingly unreliable often leading to inadvertent infringement and duplicate innovation. This paper introduces a Patent Similarity Checker, a computational tool that employs Term Frequency-Inverse Document Frequency (TF-IDF) vectorisation and Cosine Similarity scoring to help determine prior art and assess Intellectual Property Rights (IPR) risk in the pre-filing stage. The system processes patent documents through a pipeline of tokenisation, stop-word removal and stemming before building a weighted document-term matrix. Pairwise Cosine Similarity scores are calculated across the corpus to quantify the textual proximity between a candidate patent and existing filings and are mapped onto a tiered risk scale of low, moderate, or high providing a reproducible data-backed basis for prosecution decisions for inventors and legal practitioners. In addition to the technical perspective, the paper places this tool in the context of Indian patent law, the novelty and inventive step requirements under the Patents Act, 1970 and similar TRIPS obligations. The report considers whether algorithmic similarity scores may be used to aid examiner judgement in prior-art determinations, and whether they may be of evidentiary value in opposition or infringement proceedings. The paper also raises important questions about the limits of purely textual matching, and especially its inability to capture claim scope, functional equivalents, or drawing-based disclosures, arguing that the tool is best understood as an accessible first-pass filter rather than a replacement for expert legal analysis. In conclusion, this study supports the controlled use of NLP-based approaches in patent practice, providing a scalable tool for mitigating IPR disputes ex-ante.
Introduction
The text presents a system called a Patent Similarity Checker designed to handle the growing number of patent applications and the difficulty of manually identifying prior art. It proposes an automated approach that uses NLP and information retrieval techniques to measure semantic similarity between patent documents and classify them into legal risk categories under the Indian Patents Act, 1970.
The system processes patent texts (PDF, DOCX, TXT) through an NLP pipeline involving tokenization, stop-word removal (including patent-specific terms), and stemming. After preprocessing, documents are converted into numerical vectors using TF-IDF weighting, which reflects term importance across a corpus of patents.
Similarity between patents is calculated using cosine similarity, producing a score between 0 and 1 that indicates how closely two documents match. These scores are organized into a similarity matrix to compare multiple patents at once.
Based on similarity values, patents are mapped into a three-level IPR risk classification, helping identify potential prior art conflicts and reducing the risk of incorrect patent grants. The system integrates computational methods with legal standards (including TRIPS and Indian patent law) to support faster, more consistent patent examination.
Conclusion
The present paper has illustrated the process by which an Automated Patent Similarity Checker can be formulated and deployed as an ex-ante instrument of compliance for both the Indian and international patent systems. Through employing a deterministic pipeline consisting of tokenization, stop-word elimination, and Porter stemming along with the TF-IDF vector space model, the tool achieves a highly formalized mathematical method of detecting surface textual similarities. When benchmarked on a particular document, the system attained a maximum pairwise Cosine Similarity value of 45.51%, ultimately placing the document within the “Moderate Risk” category. The objective and replicable classification thus provides a valuable benchmark that assists inventors and corporate legal teams in their pre-patent application evaluations and alerts them to any possible problems concerning novelty as per Section 2(1)(l) of the Patents Act, 1970, before embarking upon costly and time-consuming formal proceedings.
Nonetheless, the practical application of the model equally serves to highlight essential limitations inherent to any model that only involves surface-level lexicon matching. The most significant design flaw of a sparse vector model or a simple TF-IDF system is the inability of the model to account for functional similarity semantically. As the algorithm requires exact token root overlap, two different patent papers may explain the same technology or design principle using entirely different lexicons (such as “fastening element” and “locking pin”), leading to a low cosine similarity even though the papers represent absolute anticipation.
In addition, the present system runs entirely in the textual realm, rendering it incapable of recognizing non-textual information like technical illustrations, flow diagrams, chemical formulae, or even advanced mathematical equations, which may well carry the gist of the engineering invention in question. As a result, although the Patent Similarity Checker emerges as an extremely scalable, high-speed, and impartial preliminary screening tool, it must be noted that, theoretically speaking, it is more of a decision support aid than an independent substitute for the nuanced and expert interpretation skills of a patent attorney or examiner.
References
[1] H. H. Shomee, A. Bhattacharjee, and T. Chakraborty, “A survey on patent analysis: From NLP to multimodal AI,” in Proceedings of the Association for Computational Linguistics (ACL), 2025.
[2] L. Jiang, “Natural Language Processing in the patent domain: A survey,” Artificial Intelligence Review, vol. 58, no. 3, 2025.
[3] A. Ali, M. Hussain, and S. Rahman, “Innovating patent retrieval: A comprehensive review of prior-art search techniques,” AI, vol. 7, no. 5, 2024.
[4] G. S. Ascione and V. Sterzi, “A comparative analysis of embedding models for patent similarity,” arXiv preprint arXiv:2403.16630, 2024.
[5] Z. Peng and Y. Yang, “Connecting the dots: Inferring patent phrase similarity with retrieved phrase graphs,” arXiv preprint arXiv:2403.16265, 2024.
[6] H. Jiang, X. Wang, and J. Zhao, “Deep learning for predicting patent application outcome,” Journal of Innovation & Knowledge, vol. 8, no. 2, 2023.
[7] Y. Yoo, C. Jeong, S. Gim, J. Lee, Z. Schimke, and D. Seo, “A novel patent similarity measurement methodology: Semantic distance and technological distance,” arXiv preprint arXiv:2303.16767, 2023.
[8] L. Siddharth, G. Li, and J. Luo, “Enhancing patent retrieval using text and knowledge graph embeddings: A technical note,” arXiv preprint arXiv:2211.01976, 2022.
[9] G. Li, L. Siddharth, and J. Luo, “Embedding knowledge graph of patent metadata to measure knowledge proximity,” arXiv preprint arXiv:2211.01768, 2022.
[10] A. Trappey, C. Trappey, U. Govindarajan, and J. Sun, “Patent value analysis using deep learning models—The case of IoT technology mining for manufacturing industry,” IEEE Transactions on Engineering Management, vol. 69, no. 5, pp. 2560–2572, 2022.
[11] H. Alshowaish, Y. Al-Ohali, and A. Al-Nafjan, “Trademark image similarity detection using convolutional neural networks,” Applied Sciences, vol. 12, no. 3, 2022.
[12] N. Meuschke and B. Gipp, “Leveraging citation networks for patent similarity detection,” in Proceedings of JCDL, IEEE, 2021.
[13] P. Morales, M. Flikkema, C. Castaldi, and A. de Man, “Patent analytics and innovation measurement using artificial intelligence,” Science and Public Policy, vol. 48, no. 4, 2021.
[14] S. Reimers and I. Gurevych, “Sentence-BERT based semantic similarity models for technical and patent text,” in Proceedings of EMNLP, 2021.
[15] X. Zhou, Z. Hu, and A. Lin, “Evaluation and identification of high-value patents using machine learning approaches,” Journal of Informetrics, vol. 15, no. 2, 2021.